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Abstract Background: Big Data is a relatively new 
field of research and technology, and literature reports 
a wide variety of concepts labeled with Big Data. The 
maturity of a research field can be measured in the 
number of publications containing empirical results. In 
this paper we present the current status of empirical 
research in Big Data. Method: We employed a system¬ 
atic mapping method with which we mapped the col¬ 
lected research according to the labels Variety, Volume 
and Velocity. In addition, we addressed the application 
areas of Big Data. Results: We found that 151 of the 
assessed 1778 contributions contain a form of empiri¬ 
cal result and can be mapped to one or more of the 3 
V’s and 59 address an application area. Conclusions: 
The share of publications containing empirical results 
is well below the average compared to computer science 
research as a whole. In order to mature the research on 
Big Data, we recommend applying empirical methods 
to strengthen the confidence in the reported results. 
Based on our trend analysis we consider Volume and 
Variety to be the most promising uncharted area in Big 
Data. 
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1 Introduction 

A sharp increase in the number of publications related 
to the Big Data field in the past years makes it difficult 
to get a good overview of the current state-of-the-art. 
Big Data technology is diverse and can be applied to 
many areas. Big Data features in many trend reports 
and academic publications. In order to get an overview 
of the field, we have performed a systematic mapping 
study and assessed to which degree empirical results 
have been reported. In our study empirical results mean 
that a technology or concept has been tested and eval¬ 
uated so that the result can be seen as a part of an evi¬ 
dence base. Concepts or technology that are merely re¬ 
ferred to and not tested or evaluated are excluded from 
this study. Generally speaking. Big Data is a collection 
of large data sets with a great diversity of types so that 
it becomes difficult to process by using state-of-the-art 
data processing approaches or traditional data process¬ 
ing platforms |156j . In a 2011 Gartner report |104| Doug 
Laney explains the concept of Volume, Variety and Ve¬ 
locity in data management. These are known as the 
3V’s and characterize the concept of Big Data. In addi¬ 
tion to these 3 fundamental V’s, many other V’s have 
emerged, though these differ per the special feature the 
authors of these publications happen to need. 

In 2012, Gartner revised and gave a more detailed 
definitioiBas: Big Data are high-volume, high-velocity, 
and/or high-variety information assets that require new 
forms of proeessing to enable enhaneed deeision making, 
insight diseovery and proeess optimization. More gener¬ 
ally, a data set ean be ealled Big Data if it is formidable 
to perform eapture, curation, analysis and visualization 
on it at the current technologies. 


^ http://www.gartner.com/resld=2057415 
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NIST |5()| defines Big Data as: Big data consists of 
advanced techniques that harness independent resourees 
for building scalable data systems when the character¬ 
istics of the datasets require new architectures for effi¬ 
cient storage, manipulation, and analysis. 

All agree to the fact that Big Data needs to be big, 
and in order to be assessed as Big Data, one needs to 
address at least one of the aspects of volume, velocity 
or variety. However, when one looks into the literature, 
one finds quite quickly that publications that through 
their title, keywords or abstract give the impression to 
deal with Big Data in fact do not address these aspects. 
Sjpberg et al. |168| state that empirical research seeks 
to explore, describe, predict, and explain natural, so¬ 
cial, or cognitive phenomena by using evidence based 
on observation or experience. It involves obtaining and 
interpreting evidence by, e.g., experimentation, system¬ 
atic observation, interviews or surveys, or by the careful 
examination of documents or artifacts. Work done in an 
empirical manner can be used as an evidence base for 
further research. In order to separate the sheep from 
the wool, we committed a systematic mapping study 
taking into account only publications that provide em¬ 
pirical results or address 3 V aspects of Big Data. 

1.1 Study approach and contribution 

During our mapping of the Big Data literature we found 
no systematic review of empirical work carried out in 
the field of Big Data. We did identify related studies 
and describe these in Section 14.11 In order to create an 
overview of the areas that are addressed, this paper de¬ 
scribes how we carried out a systematic literature map¬ 
ping with a method similar to [124] to map existing 
Big Data literature with a form of empirical evidence 
to the three V’s of Big Data as well as application ar¬ 
eas. The method is described in detail in Section |2] We 
chose to limit the mapping to the 3 V’s as these are the 
fundamental issues for Big Data. Many other V-terms 
have been defined later though none of these are used 
consistently, which limits the mapping possibilities. 

The main contributions of our study are; 

— Systematic mapping of the findings of empirical Big 
Data Studies 

— Identification of Big Data studies containing empir¬ 
ical evidence 

— Overview of application areas of empirical Big Data 
Studies 

— Trend analysis of empirical Big Data Research 

— Identification of and discussion about related sur¬ 
veys 

A summary of some of our conclusions: 


— The number of reports on Big Data are rising, both 
empirical and non-empirical 

— Roughly 10 percent of the contributions on Big Data 
include empirical results 

— Application areas have been getting more attention 
over time 

— We recommend applying empirical methods to 
strengthen the confidence in the reported results 

— Based on our trend analysis we consider Variety to 
be the most promising uncharted area in Big Data 


1.2 Structure of this paper 

The paper is structured as follows. Section [^ describes 
the general method employed in the work presented in 
this paper as well as our specific implementation of this 
method. The results of the different stages of this work 
are presented in Section]^ Section [^presents an analy¬ 
sis of the results and Section [S] discusses the limitations 
of our study. Finally, Section [^describes our conclusion 
and our recommendations for further research. 


2 The systematic mapping process 

The systematic mapping process is an iterative process 
where each step builds upon the previous. The process 
starts with a research question and ends with a sys¬ 
tematic map, see Figure We have based our system¬ 
atic mapping procedure on Kai Petersen’s and Robert 
Feldt’s work |155| . In this section we first highlight the 
step (in a text box) in the systematic mapping process 
as described by Petersen and Feldt, each step in the 
process description is then followed by a description of 
how we implemented this step in the process. 


Process Steps 


DefinUon o< 
Research Quesrton 

Conduct Search 

Screening of Papers 

Keywording using 
Abstracts 

Data Extraction and 
Mappng Process 


Review Scope 

AJI P«>efs 

Reievani Papers 

CIssstficaiion 

Scheme 

Systematic Map 


Otftcomes 


Fig. 1: The stages of the systematic mapping pro¬ 
cess |155| . 
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2.1 Definition of Research Questions (Research Scope) 

The main goal of a systematic mapping study is to 
provide an overview of a research area, and identify 
the quantity and type of research and results avail¬ 
able within it. Often one wants to map the frequen¬ 
cies of publication over time to see trends. A sec¬ 
ondary goal can be to identify the forums in which 
research in the area has been published. 

We have defined the following research questions: 

Research Question 1: Have mapping studies 
with similar goals to ours been carried out? 

Research Question 2: What is the share of 
studies that ground their results with empirical 
methods? 

Research Question 3: How are studies that 
provide empirical results grouped according to the 
“three Vs”? And what is the distribution of these 
different groups? 

Research Question 4: What are the applica¬ 
tion areas of Big Data and how are they dis¬ 
tributed? 

Research Question 5: Which publication out¬ 
lets are most prominent? 

Research Question 6: Can we identify any 
trends within Empirical Big Data Research? 

2.2 Conduct Search 

The primary studies are identified by using search 
strings on scientific databases or browsing manually 
through relevant conference proceedings or journal 
publications. A good way to create the search string 
is to structure them in terms of population, inter¬ 
vention, comparison, and outcome |98| . The struc¬ 
ture should of course be driven by the research ques¬ 
tions. Keywords for the search string can be taken 
from each aspect of the structure. For example, the 
outcome of a study (e.g., accuracy of an estimation 
method) could lead to key words like “case study” or 
“experiment” which are research approaches to de¬ 
termine this accuracy. 

In this study we used Elsevier’s Scopu^ for 
our search. Scopus delivers the most comprehensive 
overview of the world’s research output in the fields of 

^ http://www.scopus.com 


science, technology, medicine, social sciences and arts 
and humanities. It claims to be the largest abstract and 
citation database of peer-reviewed literature and within 
our domain it is a valid choice. Test searches done 
within other databases all returned subsets of the result 
from Scopus (e.g. the results from “(Big and Data) and 
(PublishedAs:journal) and (Keywords:Big AND Key- 
words:Data)” limited to 2014 and earlier in ACM digital 
library was all part of the Scopus results.) 

We searched for “Big Data” in the title, abstract and 
keywords. We included only papers that are accepted 
in journals, or in press for journals as well as reviews. 

This resulted in the following search string: 

TITLE-ABS-KEY("Big Data") AND DOCTYPE(ar OR re) 
AND PUB YEAR < 2015 
AND (LIMIT-TO(LANGUAGE,"English")) 

AND (LIMIT-TO(SRCTYPE,"j") 

OR LIMIT-TO(SRCTYPE,"k")) 

The string above is defined by the Scopus search query 
language which can be accessed at Scopu^ where 
“DOCTYPE(ar OR re)” limits to article or review, 

“PUBYEAR < 2015” limits to publication from 
before 2015, “(LIMIT-TO(LANGUAGE,"English"))” 
limits to publications with English as the source 
language and “LIMIT-TO(SRCTYPE,"j") OR 
LIMIT-TO(SRCTYPE,"k")” limits to journals and 
book series. 

At the time of the query this also resulted in three 
publications dated 2015 in direct conflict 

with the query parameters. This may be because of 
some journal predating an article (indexing it when it 
is accepted, but before it is actually published). We 
have included these papers in the data for completeness. 
These publications were later excluded in our selection 
process and thus will not be included in any of the 
analysis except from the graphs presenting the total 
number of publication and any calculation or analysis 
that depends on the total number of publications. 


2.3 Screening of Papers for Inclusion and Exclusion 
(Relevant Papers) 

Inclusion and exclusion criteria are used to exclude 
studies that are not relevant to answer the research 
questions. The criteria show that the research 
questions influenced the inclusion and exclusion 
criteria. It is useful to prototype inclusion and 
exclusion parameters with a limited set of papers. 


® http://www.scopus.com/search/form.url?display=advanced 
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Rather than having specific inclusion criteria other 
than the selection by the search query described above, 
our method includes every paper from the base corpus 
until excluded. Exclusion criteria are used to exclude 
studies that are not relevant to answer the research 
questions. One may however regard the inverse of 
criterion 3 and 4 as inclusion criteria. See Table □ for 
the criteria. 

Given the inclusion and exclusion criteria used in 
the study described in the process description, we 
defined criteria suitable for our data-set. Out of the 
full dataset, we randomly chose 100 papers to test the 
inclusion and exclusion parameters, prior to reading the 
full set of papers. The random selection was based on 
numbering in Endnote, we simply took approximately 
every 15th paper and included it in our random set. 
The first exclusion criteria is “no abstract” (some of 
the results appeared to be short-papers in magazines, 
and these typically do not include abstracts), as, if 
there is no abstract, we simply cannot see whether the 
publication is relevant or not. The second exclusion 
criteria is “Source language other than English”. Some 
abstracts were written in a way that we typically see 
when a machine translation is applied, which caused 
doubt regarding the actual content of the contribution. 
Upon further investigation of the meta-data, we noticed 
that some contributions do not have English as a 
source language. This will make us unable to do further 
investigation when needed and therefore we chose to 
exclude these articles. Once the contribution passed 
the first two criteria, we looked into the content. Only 
clear contributions to Big Data are of interest for 
this mapping study and therefore we exclude (criterion 
3) abstracts that do not clearly define contribution 
of work, as well as abstracts (criterion 4) that are 
clearly not related to the modern term Big Data (e.g. 
HH talks about “Big Data reduction”, however it is 
clearly not related to the modern term “Big Data”). 
Publications with very small data sets that claim that 
the solution will work on a huge dataset without having 
a convincing strategy for this are also excluded by 
criterion 4. See also Table □ 


Exclusion criteria 

mimber 

Criteria 

1 

No abstract. 

2 

Source language other than English 

3 

Abstract does not clearly define 
contribution of work. 

4 

Clearly not related to Big Data. 


Table 1: Exclusion criteria 


2.4 Key wording of Abstracts (Classification Scheme) 

Keywording is a way to reduce the time needed in 
developing the classification scheme and ensuring 
that the scheme takes the existing studies into 
account. 

We adopted the systematic process for classification 
from m- However, instead of searching for keywords 
to base the cluster map on, in our case, the keywords 
were already defined by Laney as explained in the 
introduction. In addition to the 3 V’s we also defined 
"application area" as a keyword to map to. As for the 
mapping to empirical keywords, we extracted a list of 
empirical method keywords, which has been compiled 
in a matrix showing co-occurrences in Table □ 



Fig. 2: The three stages of the systematic mapping. 


2.5 Data Extraction and Mapping of Studies 
(Systematic Map) 

When having the classification scheme in place, the 
relevant articles are sorted into the scheme, i.e., the 
actual data extraction takes place. The classification 
scheme evolves while doing the data extraction, 
like adding new categories or merging and splitting 
existing categories. A scheme, for example in Excel, 
should be used to document the data extraction 
process. The table should contain each category 
of the classification scheme. When the reviewers 
enter the data of a paper into the scheme, they 
provide a short rationale why the paper should be 
in a certain category (for example, why the paper 
applied evaluation research). From the final table, 
the frequencies of publications in each category can 
be calculated. 

Mapping data in graphs is a useful aid for the 
reader to understand the analysis. Visualization 
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alternatives could be found in statistics, HCI and 
information visualization fields. 

We began by categorizing all articles based on their 
abstracts into four categories. We determine whether 
the article in question have contributed to the Big 
Data field itself in term of either volume, variety or 
velocity. If the article is simply applying one or more 
Big Data techniques in a case, we identified whether 
this is a Big Data experiment that has contributed to 
the field itself by proving that “doing X is possible with 
these techniques” to a reasonable degree or if this has 
little effect on the field. In addition, we categorized the 
contribution according to the empirical methods used: 

1. Volume: Describes improvements and progress 
within technologies and methods for handling 
increases to the volume of data. E.g. Optimizing 
analysis methods to reduce runtime {0{N^) —)■ 
0(fV^)) thus enabling users to handle larger volume 
of data, but not approaching stream/real-time 
speed (velocity), or improved methods for handling 
storage and transfer of Big Data. 

2. Variety: Describes improvements and progress 
within technologies for handling variety of data. E.g. 
new methods for classification that exploits very 
large amounts of data. 

3. Velocity: Describes improvements and progress 
within technologies for coping with the 
speed of incoming data. E.g. Decreasing 
turnaround/response time for analysis, approaching 
real-time/stream analysis or completion time 
guarantees. Technologies and methods for handling 
a fire hose of incoming datej^ one pass algorithms 
for computing approximate statistic /analvtics |109| 

4. Application area: Describes Big Data technology 
different application areas; not innovating through 
new Big Data technology but through applying Big 
Data Technology to new areas. 


3 Findings from Data 

This section describes the outcomes of the steps in the 
method described in Section The results chapter map 
directly to the stages of the systematic mapping process 
described in Figure 

The following section is structured as follows. 
Subsection 13.11 describes the result of Definition of 
research question, conducting the search and screening 
of papers. Subsection |3.2| describes the result of 

^ See http://cacm.acm.org/blogs/blog-cacm/155468-what- 
does- big- data- mean / fulltext 


“Keywording using abstracts”. Subsection |3.3| describes 
the result of “Data Extraction and Mapping Process” 

The method used can be summarized in three main 
stages, as shown in Figure]^ 

The outcome of stage 3 is described in Section |3.3[ 


3.1 Review scope, all papers and relevant papers 

Review scope: The goal of this study is to understand 
the state-of-the-art within the field of Big Data. We 
aim to identify past and current trends. A secondary 
goal is to identify the forums that publishes research in 
the field of study. Our research questions reflects these 
goals. 

All papers: The query as described in 
Subsection 12.21 was sent to Elsevier’s SCOPUS 
February 12th 2015, and resulted in 1778 publications. 
However some of the publications had duplicate entries 
in the Scopus database, typically registered in two 
different years or in two different publications. These 
duplicate entries were removed and the final number of 
unique publications is 1749. In Figure]^ we have listed 
the distribution of publications per year. 

Relevant papers: In Figure|^we show the number 
of included papers per year. We also present which of 
the journals were most prominent in Table 



Fig. 3: Distribution of the included journal articles 
according to published year. 
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Fig. 4: Distribution of the journal articles that contains 
“Big Data” or “Big Datum” in title, abstracts or 
keywords according to published year. This includes 
old uses of the term “Big Data” as can be seen by the 
publication dated 1957. 


Publication source 

No. 

Lecture Notes in Computer Science 

13 

Proceedings of the VLDB Endowment 

8 

Future Generation Computer Systems 

7 

IEEE Itansactions on Emerging i'opics 

6 

in Computing 

Distributed and Parallel Databases 

5 

IEEE Network 

4 

Expert Systems with Applications 

4 

IEEE Itansactions on Knowledge and 

4 

Data Engineering 

Journal of Supercomputing 

4 

Knowledge and Information Systems 

4 


Table 2: Publication sonrces (labeled as jonrnals by 
Scopns) by number of included publications. 



Fig. 5: A flowchart describing the classification scheme 
applied in this review process. 


Application area according to what they contributed 
with in the field of Big Data. Table|^shows the mapping 
results according to publication year. The publications 
are also mapped onto a Venn diagram in Fignre 
showing the number of publications in each V and 
their intersections. The numbers in Fignre [^correspond 
to Table jH which also includes the references to the 
publications. Application areas are listed in Table [T0| 


Volume 

85 

11 5 

4 

Variety ^ Velocity 
33 11 


Fig. 6: Venn diagram of the mapped stndies. 


3.2 Classification scheme 


First we prodnced onr primary corpns by applying 
the exclnsion criteria (see Table [^ to the initial 
population of papers described in the previons section. 
When reading the title and abstract we first checked 
if the paper was affected by any of onr exclnsion 
parameters. After this check was passed, keywording 
was applied in the sense that main contributions were 
highlighted in the meta-data. This process is ontlined 
in Fignre Some articles turned out to be application 
area descriptions rather than contributions to any of 
the V’s. These were classified accordingly. 

3.3 Systematic map 

Through our classification we mapped the publications 
onto four categories Volume, Velocity, Variety and 


3.4 Empirical Methods 

In Table we provide an overview of the means 
that were identified as methods for being evalnated as 
being empirical. In the cross matrix we see that some 
contribution nse several methods in order to prove the 
value of their work. 


Reference Year 

Variety 

Velocity 

Volume 

2009 

1 

0 

1 

2011 

0 

0 

2 

2012 

1 

3 

9 

2013 

19 

8 

39 

2014 

28 

12 

54 

Total 

49 

23 

105 


Table 4: Three V’s according to publishing year. 
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Bench. 

Case 

study 

Demon. 

Eval. 

Exp. 

Vali. 

Impl. 

Model 

Simul. 

Verify 

Benchmark 

14 

0 

4 

0 

8 

1 

4 

0 

0 

1 

Case study 

0 

6 

1 

1 

0 

0 

0 

0 

0 

0 

Demonstrate 

4 

1 

48 

10 

32 

3 

2 

3 

2 

1 

Evaluate 

0 

1 

10 

35 

14 

1 

4 

3 

1 

1 

Experiment 

8 

0 

32 

14 

108 

7 

13 

4 

3 

4 

Implement 

4 

0 

2 

4 

13 

1 

22 

1 

1 

1 

Model 

0 

0 

3 

3 

4 

4 

1 

23 

1 

0 

Simulation 

0 

0 

2 

1 

3 

1 

1 

1 

11 

0 

Validate 

1 

0 

3 

1 

7 

12 

1 

4 

1 

0 

Verify 

1 

0 

1 

1 

4 

0 

1 

0 

0 

8 

Total 

32 

8 

106 

70 

193 

30 

49 

39 

20 

16 


Table 3: Cross matrix of empirical methods, showing which empirical method keyword was used how many times, 
and their co-occurrence. The top row is an abbreviated version of the first column. 



2014 

2013 

2012 

2011 

Benchmark 

8 

5 

1 

0 

Case study 

4 

2 

0 

0 

Demonstrate 

25 

20 

2 

1 

Evaluate 

20 

10 

4 

1 

Experiment 

67 

37 

4 

0 

Implement 

14 

5 

2 

1 

Model 

10 

12 

1 

0 

Simulation 

5 

4 

2 

0 

Validate 

7 

4 

1 

0 

Verify 

4 

2 

1 

1 

Total 

164 

101 

18 

4 


Table 5: Empirical methods according to reference year. 


4 Analysis 

From the total of 1749 included studies. Figure 
depicts the distribution of the included studies sorted 
by publication year. Starting with a total of 3 papers 
in 2009-2011, we see that in 2012 there was an increase 
in relevant publications and near fourfold in 2013, this 
trend continues into 2014. So, we can state there is a 
clear up-going trend in relevant publications. 

Of the 210 included studies, 151 could be mapped 
onto one or more of the V’s, the remaining 59 are 
papers describing Big Data technologies applied to 
application areas. In the VENN diagram we chose 
to exclude to view the application areas and keep 
the focus on the V’s. The most addressed area is 
volume with 85 publications, followed by variety with 
33 and velocity with 11. Research addressing both 
variety and volume is most prominent with 11 included 
contributions, whereas volume and velocity is combined 

5 times. Finally, variety and velocity have five included 
contributions and only four studies |361IT7llTMllll5| 
mention all three of the main areas of Big Data. From 
this we can conclude that the most mature areas in 


terms of published results are Velocity and Volume. 
We do want to note that many of the contributions 
mention Hadoop and MapReduce as a basis platform 
while the focus of content is directed towards velocity 
and/or variety. This may indicate that the storage is 
taken for granted when this is used. 



total 

included 

% 

before 2009 

31 

0 

0,00 

2009 

11 

1 

9,09 

2010 

8 

0 

0,00 

2011 

31 

2 

6,45 

2012 

187 

17 

9,09 

2013 

504 

73 

14,48 

2014 

974 

118 

12,11 

2015 

3 

0 

0,00 

sum 

1749 

211 

12,01 


Table 7: Total number of journal publications per year. 


Table and Figure give an overview of the 
total number of journal papers per year that we have 
assessed, as well as the number of included paper. It 
becomes very clear that the majority of publications 
do not have empirical findings. These numbers are 
also presented in Figure (included papers per year) 
and Figure (total papers per year). The inclusion 
percentage can be seen graphed in Figure and also 
per V and Application area in Figure]^ 
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V 

Publication count 

Publications 

Variety 

33 

l2JtllJ[12J[1611231 [26)1441153116^IZ2J|96| I101II102II114II113II111I 
[I25][ll9][l28][l29][l33] ITiil 114^ ITiSHT^ [T53l 
[l66][Tn][n6][T85][l89][2n] [233] 


Volume 

85 

3SIEIII1IIIZI 

4ni49ll58l[57ir 

issiiiozi ino] 
Tsn [Teol [T69 
inn [TMlfTM) 
2271 [2201 [222 

18] [19] [m [ID [271 [ig 12911321131113511371 [391 

3011631 [64][66]l67][7i][73] [71 [ST][82|[83][88] l93l IMl [971 

[m [ml [na [El [m [et] [na [m [m 

1T721 [T73l [1801 ITSll [T82l [T88l 1T931 ITOOl [192] [TM] 

200] [205] [20l [204] [207] [20l [212] [210] [211 |2T1[^ 
[^ |W1 [228] |23n [2321 

Velocity 

11 

idlEillO ED 

84111031 [164111831 [2091 [214] [219| 

Volume and Velocity 

5 

ill [81 [117) [1471 

2351 

Velocity And Variety 

2 

65II211I 

Variety and Volume 

11 

5)[45)[7)[70)[126)[163) I170II187II174II178112291 

Volume, Variety & 

4 

36||47||in6||115| 


Velocity 


Table 6: Number of publications classified to that V and the references to those publication. 



Fig. 7: Percentage of empirical (included) studies per 
year. 



Fig. 8: Percentage of empirical (included) studies per 
year per mapped region: Volume, Velocity, Variety and 
Application Area. 

We have mapped the included studies to 4 
categories. Variety, Velocity, Volume and Application 
area. The latter means that a paper does contribute 
empirically to Big Data by applying Big Data 
technology to a field, however, without forwarding the 
technology itself. Some papers address multiple V’s, 
and if so, are mapped accordingly. Hence, the total 
number of V’s is higher than the total number of 
included papers. 

Table gives an overview of the included papers 
mapped to according to the V and application area. The 


column “% change” indicates the change from the year 
before and is meant to give an indication on whether 
there is growth. We see that in the past 3 years there has 
been an increase for all categories in absolute numbers. 
Though, measured in percentage growth compared to 
the previous year we see a decline. 

60 



Fig. 9: Number of empirical (included) studies per 
year per mapped region: Volume, Velocity, Variety and 
Application Area. 

Table gives an overview of the included papers 
mapped according to the V’s and application areas. 
The column “included %” indicates the percentage of 
included papers. For example, in 2009 we included 1 
paper which was mapped on both Variety and Volume 
(also explaining why the total can be higher than 100% 
per year). In the past 3 years, for Variety we see that 
the percentage of included papers has increased quite 
a bit from 2012 to 2013, with a little decline to 2014. 
The inclusion percentage of Volume is stable from 2012 
to 2015. Velocity seems to drop from 2012 to 2013 and 
stabilize to 2014, but the amount of publications in 2012 
is so low that it is not statistically significant. Though, 
the total number of publications is still increasing. 

59 of the included studies is not classified as a 
direct contributor to any of the three V’s, rather the 
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Variety 

% change 

Velocity 

% change 

Volume 

% change 

Application 

% change 

2009 

1 

N/A 

0 

N/A 

1 

N/A 

0 

N/A 

2010 

- 

N/A 

- 

N/A 

- 

N/A 

- 

N/A 

2011 

0 

N/A 

0 

N/A 

2 

N/A 

0 

N/A 

2012 

1 

N/A 

3 

N/A 

9 

450 

6 

N/A 

2013 

19 

1900 

8 

266,66 

39 

433,33 

18 

300 

2014 

28 

147,36 

12 

150 

54 

138,46 

35 

194,44 


Table 8: Included publications and their mapping to V and application area. The table also reflects the change 
over time in percentage. 



Variety 

included % 

Velocity 

included % 

Volume 

included % 

Appl. 

included % 

2009 

1 

100,0 

0 

0,00 

1 

100,0 

0 

0,00 

2010 

- 

N/A 

- 

N/A 

- 

N/A 

- 

N/A 

2011 

0 

0,0 

0 

0,0 

2 

100,0 

0 

0,0 

2012 

1 

5,88 

3 

17,65 

9 

52,94 

6 

35,29 

2013 

19 

26,03 

8 

10,95 

39 

53,42 

18 

24,66 

2014 

28 

23,93 

12 

10,26 

54 

46,15 

35 

29,91 


Table 9: Overview of the included publications mapped to V’s and the applications areas. 


publication describes Big Data technology applied to 
different domains to such a degree that it can be viewed 
as a contribution to Big Data as a held. 


Application area ^ 

Social (network) analysis 8 

(Cyber) Security and 6 

privacy 

Visual analytics 5 

Predictive analytics 4 

Intelligent Transport 4 

Systems 

Search engine/data 3 

exploration 

Environmental monitoring 3 
and management 
(Bio)Medical 3 

Text Extraction 3 


Publications 



Table 10: Publications grouped by application area. 


We cannot give a conclusive trend analysis based on 
our study, though as we do have indications, we wanted 
to see if these coincide with generally available trend 
reports. Trend reports and predictions are abundant; 
a quick Google search on “latest trends in Big Data” 
returns millions of results. At the time of search (on 
Google), the first hits were: 

Gartner Predicts Three Big Data Trends for 
Business Intelligenc^ 


FI 

By 2020, information will be used to reinvent, 
digitalize or eliminate 80% of business processes and 
products from a decade earlier. 

F2 

By 2017, more than 30% of enterprise access to 
broadly based Big Data will be via intermediary 
data broker services, serving context to business 
decisions. 

F3 

By 2017, more than 20% of customer-facing analytic 
deployments will provide product tracking 
information leveraging the loT 


In addition we have studies within 
Recommendations |1411I165| . Gost reduction m, 
Image and video classification tasks |33], Stimulation 
of learning experience ca, Glustering gni, 
ATG IHOji Telecom [HI], Gloud |181| . Kernel 
spectral clustering 1 138 II 139] (However these were 
very close to be classified as applicable to Velocity 
and Variability), Knowledge provision |142| . Smart 
Grid m, Analytics m, Space m, Griminal 
investigation HU], Marketing m, retrieval of 
learning objects |186| . Bibliometrics |202] . Service 
operation |112] . recreational studies |195| 


Table 11: Trends within Big Data predicted by Gartner 
(published by Forbes) 


Top Big Data and Analytics Trends for 20110 


® http://www.forbes.com/sites/gartnergroup/2015/02/12/ 
gartner- predicts- three- big- data- trends- for- business- intelligence / 
^ http://www.zdnet.com/article/2015-interesting-big-data- 
and- analytics- trends / 
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Zl 

More Magic 

Z2 

Datafication 

Z3 

Multipolar Analytics 

Z4 

Fluid Analysis 

Z5 

Community 

Z6 

Analytic Ecosystems 

Z7 

Data Privacy 


Table 12: Big Data and Analytics trends predicted in 
2015 by ZDnet. 


CIO’s 5 Big Data Technology Predictions for 20110 


Cl 

Data Agility Emerges as a Top Focus 

C2 

Organizations Move from Data Lakes to 
Processing Data Platforms 

C3 

Self-Service Big Data Goes Mainstream 

C4 

Hadoop Vendor Consolidation: New Business 
Models Evolve 

C5 

Enterprise Architects Separate the Big Hype 
from Big Data 


Table 13: CIO’s Big Data Technology predictions for 
2015. 

We have enumerated the headings from the trend 
reports for easier referencing. 

We are aware that this is a limited subset of all 
available trend reports, though these should give at 
least a general impression and a basis for comparing 
our trend indication based on the literature study. We 
have omitted reports that require registration. 

Based on our literature study, we can indicate that 
Application is more on the rise than Variety, Velocity 
and Volume, thus Big Data technology is becoming 
more applied. 

The latter is reflected in FI, F2, C3, C4, C5 and Z2. 

Volume and Velocity do not seem to be reflected in 
these reports. 

On general terms, we can state that the reports 
agree that Big Data is becoming more mature and 
therefore more applied and that analytics is the path 
to choose if you want to stay in front of the state-of-the 
art. This is supported by Kambatla et al. [92]. 

4.1 Related mappings, surveys and reviews 

In addition to the above, we also identified studies that 
did not meet our inclusion criteria; though do provide 
a contribution in creating an overview of a part of 

^ http://www.cio.com/article/2862014/big-data/5-big-data- 
technology- predictions- for- 2015 .html 


the Big Data held. Below we summarize the type of 
contribution and their conclusions. 

Sakr et al. |162| provide a comprehensive survey 
for a family of approaches and mechanisms of 
large-scale data processing mechanisms that have 
been implemented based on the original idea of the 
MapReduce framework and are currently gaining a 
lot of momentum in both research and industrial 
communities. They also cover a set of introduced 
systems that have been implemented to provide 
declarative programming interfaces on top of the 
MapReduce framework. In addition, they review several 
large-scale data processing systems that resemble some 
of the ideas of the MapReduce framework for different 
purposes and application scenarios 

Gorodov and Gubarev m have done a review 
of methods for visualizing data and provided a 
classification of visualization methods in application to 
Big Data. 

Ruixan m presents Bibliometrical Analysis on the 
Big Data Research in Ghina and summarizes research 
characteristics in order to study Big Data in-depth 
development and the future development of Big Data. 
They also provide reference information for studies 
related to Library and Information Studies. They 
conclude that research based on Big Data has taken 
shape though most of these papers in the theoretical 
stage of exploration, lack adequate practical support 
and therefore recommend to intensify efforts based on 
theory and practice. 

Ghen and Zhang |156| have done a comprehensive 
survey of Big Data technologies, techniques, challenges 
and applications. They offer a close view of Big Data 
applications opportunities and challenges as well as 
techniques that is currently adopted and used to solve 
Big Data problems. 

Jeong and Ghani m have done a review of semantic 
technologies for Big Data and conclude that their 
analysis shows that there is a need to put more effort 
into proposing new approaches, and that tools must 
be created that support researchers and practitioners 
in realizing the true power of semantic computing and 
solving the crucial issues of Big Data. 

Gandomi and Haider [50] present a consolidated 
description of Big Data by integrating definitions from 
practitioners and academics. The paper’s primary focus 
is on the analytic methods used for Big Data. A 
particular distinguishing feature of this paper is its 
focus on analytics related to unstructured data, which 
according to these authors constitute 95% of Big Data. 

Wang and Krishnan m present a review 
with an objective to provide an overview of the 
features of clinical Big Data. They describe a 
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few commonly employed computational algorithms, 
statistical methods, and software tool kits for data 
manipulation and analysis, and discuss the challenges 
and limitations in this realm. 

Fernandez et al. |35] focus on systems for large-scale 
analytics based on the MapReduce scheme and Hadoop. 
They identify several libraries and software projects 
that have been developed for aiding practitioners 
to address this new programming model. They 
also analyze the advantages and disadvantages of 
MapReduce, in contrast to the classical solutions in this 
field. Finally, they present a number of programming 
frameworks that have been proposed as an alternative 
to MapReduce, developed under the premise of solving 
the shortcomings of this model in certain scenarios and 
platforms. 

Polato et al. m have conducted a systematic 
literature review to assess research contributions to 
Apache Hadoop. The objective was to identify gaps, 
providing motivation for new research, and outline 
collaborations to Apache Hadoop and its ecosystem, 
classifying and quantifying the main topics addressed 
in the literature. 

Wu and Yamaguchi m presents a survey of 
Big Data in life sciences. Big Data related projects 
and Semantic Web technologies. The paper helps to 
understand the role of Semantic Web technologies in 
the Big Data era and how they provide a promising 
solution for the Big Data in life sciences. 

Kambatla et al. [SI] provide an overview of 
the state-of-the-art and focus on emerging trends 
to highlight the hardware, software, and application 
landscape of big-data analytics. 

Hashem et al. [55] have assessed the rise of big data 
in cloud computing. The definition, characteristics, and 
classification of big data along with some discussions 
on cloud computing are introduced. The relationship 
between big data and cloud computing, big data storage 
systems, and Hadoop technology are also discussed. 
Furthermore, research challenges are investigated, 
with focus on scalability, availability, data integrity, 
data transformation, data quality, data heterogeneity, 
privacy, legal and regulatory issues, and governance. 
Lastly, they give a summary of open research issues 
that require substantial research efforts. 

As for ongoing projects. The Byte project (EU FP7) 
is also investigating the research field of Big Data. And 
the Big Data Value Association is an initiative with the 
goal to provide the Big Data Value strategic research 
agenda (SRIA) and its regular updates, defining and 
monitoring the metrics of the cPPP and joining the 
European Commission in the cPPP partnership board. 


None of the studies above have thoroughly mapped 
the existing knowledge against the Big Data V 
concepts, nor assessed whether the contributions have 
created empirical results 

5 Discussion 

There are some limitations to this study. The first 
limitation is that we used a single source for our search. 
Scopu^ claims to be “the largest abstract and citation 
database of peer-reviewed literature: scientific journals, 
books and conference proceedings”, making it a valid 
choice. Scopus also returned a super set of the search 
results we got from trying the same query on IEEE 
Xplore, ACM digital library and Compendex. With 
regards to the mapping processes, one could claim 
that both researchers should have read all abstracts 
and discussed all. Instead, we did a pre-mapping in 
tandem and after that split the work. Each was working 
for himself, excluding the clear outliers based on the 
exclusion criteria and including the clear paper to 
include. In case of even the slightest doubt, we marked 
the publication and discussed these publications later 
on. One can also argue that it is a limitation to limit 
the mapping to the 3V’s as well as application areas 
and not include the other “V’s” that appear in the 
papers. We argue that sticking to the original 3 V’s 
gives a much more concise overview than also including 
non-standard V’s that emerge. The 3 basic V’s are well 
defined, whereas others V’s are open for interpretation. 
Another limitation is the definition of the empirical 
work. However we do not use the definitions, we just 
record the words used in abstracts that are also word 
that describe empirical methods. If an authors claims 
that they have done an experiment and that the results 
has been evaluated, we noted this and did not read 
the full publication in order to investigate if this is 
really true. We have not assessed the quality of the 
work carried out in detail, other than noting that the 
publication is a peer reviewed journal. A possible point 
of critique of this mapping study can be that we only 
searched for publications that was part of a publication 
that was reported as a journal as the field of data 
science, analytics, databases also has a large amount 
of high quality contributions disseminated through top 
level conferences. However the researchers had to make 
a practical choice between not being able to perform 
this mapping study or reducing the number of studies 
to be included. Removing all studies not being part of a 
journal seemed like a fair decision given our hard choice, 
as it does not discrimiate across the different sub-field 

8 www.scopus.com 
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that contributes to Big Data research. It can be argued 
that we did not find very clear trends in the analyzed 
data. Finally, the method for comparing the correlation 
between our result and the non-scientific trend reports 
can be argued as being weak as the method of selection 
of these reports were not as rigourus as the methods 
applied in selecting the studies. However, the trends do 
coincide. 

6 Conclusion and Recommendations 

Typically a mapping study does not assess quality, 
though as Big Data is and has been a very "hot topic" 
over the past years, the term appears in very many 
papers including papers that have not contributed to 
Big Data research. Therefore, we chose to only include 
papers that have some form of empirical approach in 
order to eliminate the chance of analyzing papers that 
are not contributing towards forwarding the evidence 
base of Big Data research. A total of 210 articles were 
included and 151 of these have been coded against one 
or more of the three “main V’s”. In addition, we have 
an overview of application areas (meaning Big Data 
technology has been applied, though not contributed 
to forwarding one of the V’s). 

Research Question 1: Have mapping studies 
with similar goals to ours been carried out? 

Answer: At the time of search we found [53] and 
[55] that are labeled as reviews, however they were 
not systematic. In |150| Park et al. use a systematic 
approach; the paper presents findings on the social 
networks of authors in co-authored papers within the 
Big Data field. 

Research Question 2: What is the share of 
studies that ground their results with empirical 
methods? 

Answer: We found that on average a bit less than 
10% (for details, refer to Table of the retrieved 
publications include a form of empirical approach. We 
also identified the type of empirical method was used. 
For details see Table and Table In the paper “The 
Future of Empirical Methods in Software Engineering 
Research”, Sjpberg et al. |168| state that "an average 
of the reviews indicates that about 20% of all papers 
report empirical studies". This means that the use 
of empirical methods in Big Data research is below 
average. 

Research Question 3: How are studies that 
provide empirical results grouped according to the 
“three Vs”? And what is the distribution of these 
different groups? 


Answer: We identified papers that could clearly 
be classified as contributing to the Big Data field 
within either an application area, or technology within 
Volume, Velocity or Variety. The analysis that followed 
revealed that Volume (105 papers) has received the 
most attention from researchers, whereas Variety (50 
papers) and Velocity (22 papers) is respectively half and 
a quarter of volume in number of publications. When 
one looks into the deviation per year, we can see that all 
V’s are still increasing in absolute numbers. For more 
information see section 1^ and figure 

Research Question 4: What are the applica¬ 
tion areas of Big Data and how are they dis¬ 
tributed? 

Answer: Big data is within many different application 
areas, we identified 65 papers describing the use of Big 
Data technology within an application area. We can see 
an increase in papers addressing an application area 
over time. For more details see section ID 

Research Question 5: Which publication out¬ 
lets are most prominent? 

Answer: From the studies retrieved by our search, 
limiting to contributions that are classified as journals 
by Scopus, we found that Lecture Notes in Com¬ 
puter Science is the most prominent channel featured 
in our selected papers, followed by Proceedings of the 
VLDB Endowment and Future Generation Com¬ 
puter Systems, for the full list see table 

Research Question 6: Can we identify any 
trends within Empirical Big Data Research? 

Answer: Based on our literature study, we can indicate 
that Application is more on the rise than Variety, 
Velocity and Volume, thus Big Data technology is 
becoming more applied. Referring to the analysis in 
section 4, we can -on general terms- state Big Data 
is becoming more mature and therefore more applied 
and that analytics is the path to choose if you want to 
stay in front of the state-of-the art. 

Recommendations: The share of publications 
containing empirical results is well below the average 
compared to computer science research as a whole. 
In order to mature the research on Big Data, 
we recommend to both use the evidence base of 
existing empirical studies in Big Data and we 
recommend applying empirical methods to strengthen 
the confidence in the reported results. As seen in Table 
[7| and |9] all of the V’s seem to be stable in their share 
of publications within the Big Data field. Publications 
of Application of Big Data technologies is rising, a 
natural explanation for this is that the Big Data field 
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and technologies has matured enough for applications. 
The least addressed areas of Big Data is Velocity and 
Variety. 
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